In today’s digital World, there’s an increased need for safety and security of each individual’s data and assets. For this, there are multiple applications allowing verification like fingerprint sensing, biometric sensing, and facial recognition to protect an individual’s owned digital or physical assets since it uniquely gives access to each individual. Therefore, to further explore how facial recognition works, in the application for access control, I undertook the project of Facial Key point Detection. This project is based on Computer Vision techniques and Convolution Neural Networks to predict facial key points on face images data that is unseen.
The primary goal of this project is to get hands on with PyTorch and attempt to implement a project that is very relevant to present requirements in devices and in technology. I pursued this project to explore how efficiently we can predict the Facial features of a face in a given picture since it is of utmost relevance both in access control domain and also in studying the emotions as depicted by the facial expressions which are described by the facial key points. This project also is an opportunity to practice transformations that are commonly used in Image Processing and inspect its effect on the results.
This project was executed on PyTorch, a Python package used for research purposes on Convolutional Neural Networks which is a Neural Network technique used for Image Data. The tasks start off with initially loading a data that readily available on the internet. This is called the YouTube Faces Data which is gathered by performing image processing techniques on YouTube videos of celebrities and then turned it into image frames with annotations on the face as facial key points. It then involved transformation of data to achieve uniformity within all image through resizing, cropping, and normalization. Then using the data, we trained a Convolutional Neural Network model based on customized set up of convolutional layers and their parameters to train the model. We then test it on the test data set of the same data. Finally, we will test it on independent data that is retrieved from different sources to test on completely new data. The tasks are also listed below.
We start off by importing the dataset which is primarily image frames of YouTube videos of celebrities with annotations on the image in the form of purple dots on their facial features. These dots are given by x,y coordinates and as per the specifications of this dataset, there are 68 points marked in each image on the face. The dataset was description can be seen on this link: https://www.cs.tau.ac.il/~wolf/ytfaces/
These images are then transformed by taking them through normalization, cropping, and rescaling beacause the images are of different sizes and resolutions. Since our model can only be trained on Tensors of the same size, we transformed the images.
We then visualize the sample training images and design our CNN network as defined below in the description of the training step.
Moving ahead, we visualize images tested on the intiial model which is not trained. Obviously the results are expected to be bad. We then go on to train our model using 2 layer CNN with Maxpooling and using Relu as the activation function. The reason behind using Relu is the ease of implementation and its ability to reduce sparsity in the network. Its a non linear function that allows ability to learn complex patterns in our image data since we tend to understand the complex facial shapes and features. Reluu also helps in reducing the vanishing gradient problem which occurs when using sigmoid and tanh.
Further, we go on to train the model at different batch sizes and epochs. However, due to computation limits of CPUs we will limit the batch size and epochs to a considerable number to get a decent loss.
Finally, we will run the trained model on our test data images and visualize them to compare the ground truth keypoints and the predicted keypoints. Lastly, I will use some of my own pictures and pictures from my personal gathering with noise in the background to see the results. To evaluate the results on independent images not from the test set, I will use Haar Cascades as face detector and using that I will predict keypoints on faces in the images that I will use.
These keypoints are related to specifications like emotion recognition, face detection.
The program below starts by importing the data and the packages that we'll use in this project. These include:
The first step is to load the data which will further be visualized. We start off by importing the data from AWS openly available image data source and unzip the zipped data file.
!mkdir /data
!wget -P /data/ https://s3.amazonaws.com/video.udacity-data.com/topher/2018/May/5aea1b91_train-test-data/train-test-data.zip
!unzip -n /data/train-test-data.zip -d /data
mkdir: cannot create directory ‘/data’: File exists --2023-12-05 01:51:40-- https://s3.amazonaws.com/video.udacity-data.com/topher/2018/May/5aea1b91_train-test-data/train-test-data.zip Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.81.246, 52.217.47.118, 54.231.230.64, ... Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.81.246|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 338613624 (323M) [application/zip] Saving to: ‘/data/train-test-data.zip.1’ train-test-data.zip 100%[===================>] 322.93M 16.3MB/s in 22s 2023-12-05 01:52:03 (14.7 MB/s) - ‘/data/train-test-data.zip.1’ saved [338613624/338613624] Archive: /data/train-test-data.zip
Here we simply import all the relevant packages to initially visualize data. We have used all these packages in our Computer Vision course Labs.
# import the required libraries
import glob
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2
We'll take a sample of coordinates of some keypoints on an image from our training data set that we just loaded
key_pts_frame = pd.read_csv('/data/training_frames_keypoints.csv')
n = 0
image_name = key_pts_frame.iloc[n, 0]
#key_pts = key_pts_frame.iloc[n, 1:].as_matrix()
key_pts = key_pts_frame.iloc[n, 1:].to_numpy()
key_pts = key_pts.astype('float').reshape(-1, 2)
print('Image name: ', image_name)
print('Landmarks shape: ', key_pts.shape)
print('First 4 key pts: {}'.format(key_pts[:4]))
Image name: Luis_Fonsi_21.jpg Landmarks shape: (68, 2) First 4 key pts: [[ 45. 98.] [ 47. 106.] [ 49. 110.] [ 53. 119.]]
To check the size of our data, we simply inspect the shape of the array to see how many image examples we have. A total of 2462 images are in the training set.
# print out some stats about the data
print('Number of images: ', key_pts_frame.shape[0])
Number of images: 3462
Here we visualize an image with annotations in the form of key points on them. This gives a perspective of the images set with facial key points that we will train on.
def show_keypoints(image, key_pts):
plt.imshow(image)
plt.scatter(key_pts[:, 0], key_pts[:, 1], s=20, marker='.', c='m')
n = 44
image_name = key_pts_frame.iloc[n, 0]
#key_pts = key_pts_frame.iloc[n, 1:].as_matrix()
key_pts = key_pts_frame.iloc[n, 1:].to_numpy()
key_pts = key_pts.astype('float').reshape(-1, 2)
plt.figure(figsize=(5, 5))
show_keypoints(mpimg.imread(os.path.join('/data/training/', image_name)), key_pts)
plt.show()
By using PyTorch tutorials and official guide referenced in Lab 9 of Computer Vision course, we make a dataset class for our keypoints dataset. This class shall also assists me in performing transformations further in the project. The class will change our imported data into a dataset that can be trained.
from torch.utils.data import Dataset, DataLoader
class FacialKeypointsDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.key_pts_frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.key_pts_frame)
def __getitem__(self, idx):
image_name = os.path.join(self.root_dir,
self.key_pts_frame.iloc[idx, 0])
image = mpimg.imread(image_name)
# if image has an alpha color channel, get rid of it
if(image.shape[2] == 4):
image = image[:,:,0:3]
# key_pts = self.key_pts_frame.iloc[idx, 1:].as_matrix()
key_pts = key_pts_frame.iloc[idx, 1:].to_numpy()
key_pts = key_pts.astype('float').reshape(-1, 2)
sample = {'image': image, 'keypoints': key_pts}
if self.transform:
sample = self.transform(sample)
return sample
As we have defined the class, we'll display some images by generating a dataset from the csv file we converted from our zipped dataset.
# Construct the dataset
face_dataset = FacialKeypointsDataset(csv_file='/data/training_frames_keypoints.csv',
root_dir='/data/training/')
# print some stats about the dataset
print('Length of dataset: ', len(face_dataset))
Length of dataset: 3462
# Display a few of the images from the dataset
num_to_display = 3
for i in range(num_to_display):
# define the size of images
fig = plt.figure(figsize=(20,10))
# randomly select a sample
rand_i = np.random.randint(0, len(face_dataset))
sample = face_dataset[rand_i]
# print the shape of the image and keypoints
print(i, sample['image'].shape, sample['keypoints'].shape)
ax = plt.subplot(1, num_to_display, i + 1)
ax.set_title('Sample #{}'.format(i))
# Using the same display function, defined earlier
show_keypoints(sample['image'], sample['keypoints'])
0 (219, 195, 3) (68, 2) 1 (240, 256, 3) (68, 2) 2 (305, 239, 3) (68, 2)
The images are not of the same size and for better training of the CNN, we need to have images that are standard in terms of size and normalized for color ranges.
One very important aspect is to convert the image which is in the form of numpy array to Tensors since Keras, PyTorch use Tensors object type for training.
For this, we created a class that manipulates the images. This class uses CV2 package to perform image manipulation. The first tep we do is to change the image to grayscale and change the colorscale from 0 to 1. Finally, we normalize the keypoints by centering them around 0 by subtracting by mean and dividing by standard deviation to get normalization.
The keypoints and the images both are normalized separately.
At this point, we import discrete transforms class from torchvision which is used mainly for CNN tasks.
Similarly, we created classes to perform rescaling using cv2 and random cropping.
THE KEY COMPUTER VISION TECHNIQUES WE APPLY HERE ARE:
Since the data is fed in PyTorch to train as a Tensor, the function also converts the data in tensors.
import torch
from torchvision import transforms, utils
# tranforms
class Normalize(object):
"""Convert a color image to grayscale and normalize the color range to [0,1]."""
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
image_copy = np.copy(image)
key_pts_copy = np.copy(key_pts)
# convert image to grayscale
image_copy = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
# scale color range from [0, 255] to [0, 1]
image_copy= image_copy/255.0
# scale keypoints to be centered around 0 with a range of [-1, 1]
# mean = 100, sqrt = 50, so, pts should be (pts - 100)/50
key_pts_copy = (key_pts_copy - 100)/50.0
return {'image': image_copy, 'keypoints': key_pts_copy}
class Rescale(object):
"""Rescale the image in a sample to a given size.
Args:
output_size (tuple or int): Desired output size. If tuple, output is
matched to output_size. If int, smaller of image edges is matched
to output_size keeping aspect ratio the same.
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
self.output_size = output_size
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
h, w = image.shape[:2]
if isinstance(self.output_size, int):
if h > w:
new_h, new_w = self.output_size * h / w, self.output_size
else:
new_h, new_w = self.output_size, self.output_size * w / h
else:
new_h, new_w = self.output_size
new_h, new_w = int(new_h), int(new_w)
img = cv2.resize(image, (new_w, new_h))
# scale the pts, too
key_pts = key_pts * [new_w / w, new_h / h]
return {'image': img, 'keypoints': key_pts}
class RandomCrop(object):
"""Crop randomly the image in a sample.
Args:
output_size (tuple or int): Desired output size. If int, square crop
is made.
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
if isinstance(output_size, int):
self.output_size = (output_size, output_size)
else:
assert len(output_size) == 2
self.output_size = output_size
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
h, w = image.shape[:2]
new_h, new_w = self.output_size
top = np.random.randint(0, h - new_h)
left = np.random.randint(0, w - new_w)
image = image[top: top + new_h,
left: left + new_w]
key_pts = key_pts - [left, top]
return {'image': image, 'keypoints': key_pts}
class ToTensor(object):
"""Convert ndarrays in sample to Tensors."""
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
# if image has no grayscale color channel, add one
if(len(image.shape) == 2):
# add that third color dim
image = image.reshape(image.shape[0], image.shape[1], 1)
# swap color axis because
# numpy image: H x W x C
# torch image: C X H X W
image = image.transpose((2, 0, 1))
return {'image': torch.from_numpy(image),
'keypoints': torch.from_numpy(key_pts)}
Here, I applied the transforms we created above to see if they work fine. The images are shown below.
Here we need to test out the transformations with correct parameters. For example, we can't crop an image with a value that is smaller than the actual image size.
Output Discussion:
We can see that if the image is wrongly cropped, the keypoints will reflect wrong segments of the image and the training set will be highly inaccurate. Therefore, we have to maintain uniformity in the cropping of images since the images are of different dimensions.
# test out some of these transforms
rescale = Rescale(100)
crop = RandomCrop(50)
composed = transforms.Compose([Rescale(250),
RandomCrop(224)])
# apply the transforms to a sample image
test_num = 300
sample = face_dataset[test_num]
fig = plt.figure()
for i, tx in enumerate([rescale, crop, composed]):
transformed_sample = tx(sample)
ax = plt.subplot(1, 3, i + 1)
plt.tight_layout()
ax.set_title(type(tx).__name__)
show_keypoints(transformed_sample['image'], transformed_sample['keypoints'])
plt.show()
Discussion: The images displayed show the effect of rescaling and cropping. We need to ensure that the cropping is done such a way that the facial features are not cropped out like it happened in the image above. This way, the model trains on incorrect pixel intensities and outputs wrong keypoints on the test images.
We now import utilities to ensure our workspaces are active during training of the model.
# import utilities to keep workspaces alive during model training
from workspace_utils import active_session
# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
We created a separated pytrhon program that defines the class Net which is used to train the model. Before we import it, the program is as below:
The description of the CNN model we implement is as follows:
Here, the model that we're using is based on three convolutional layers. Here we created a model in the class defined in standalone Python Program called 'Models'. The model we use is a Convolutional Neural Network that uses 3 convolutional layers.
The first layer is convolved with a kernel_size of 5 for the input images that are single channeled. The output we generate is 32 channel which means we use 32 filters of 5x5 size with a stride of 1 to perform convolution.
In the second layer we take 32 channels as inputs from the previous channel and apply 64 filters of size 3x3x32 to generate a 64 chanelled output. The stride remains 1.
Finally, the third layer uses a kernel of 3x3x64. We are increasing the channels but reducing the height and width dimensions of each channel of the image. The reason behind more filters is to obtain more intricate details of the facial features to better define the placement of keypoints on the facial features on the images.
We then use Maxpooling which basically uses a kernel size of 2 and a stride of 2. We have also added a dilation of 2. This means that the pooling operation will be applied with gaps, effectively skipping elements in the input tensor. The output tensor will have reduced spatial dimensions due to the pooling operation.
We then use dropout which means we turn the neurons or filters to 0 for training to avoid overfitting. This gives a better generalization of the model on unseen test data. finally we apply a linear layer with a total of 8252 inputs which are flattened from the previous layer output.
import torch
import torch.nn as nn
import torch.nn.functional as F
# can use the below import should choose to initialize the weights of your Net
import torch.nn.init as I
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 5)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.pool = nn.MaxPool2d(2,2)
self.drop2d = nn.Dropout2d(p=0.2, inplace=False)
self.drop = nn.Dropout(p=0.2, inplace=False)
self.fc1 = nn.Linear(128*26*26, 136)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.drop2d(x)
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = self.drop(x)
x = x.view(x.size(0), -1)
x = self.fc1(x)
return x
import torch
import torch.nn as nn
import torch.nn.functional as F
from models import Net
net = Net()
print(net)
Net( (conv1): Conv2d(1, 32, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1)) (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1)) (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (drop2d): Dropout2d(p=0.2, inplace=False) (drop): Dropout(p=0.2, inplace=False) (fc1): Linear(in_features=86528, out_features=136, bias=True) )
Here I transformed the dataset of images and keypoints. The transformation is executed in a separate python progarm and is ecxcuted here. The transformations are as below:
The python standalone program is as below called data_load python program which has a class FacialKeypointsDataset which extracts the data that we initially imported from AWS datasets. It also defines the functions for image transformation.
#Data_load python program
import glob
import os
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.image as mpimg
import pandas as pd
import cv2
class FacialKeypointsDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.key_pts_frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.key_pts_frame)
def __getitem__(self, idx):
image_name = os.path.join(self.root_dir,
self.key_pts_frame.iloc[idx, 0])
image = mpimg.imread(image_name)
# if image has an alpha color channel, get rid of it
if(image.shape[2] == 4):
image = image[:,:,0:3]
#key_pts = self.key_pts_frame.iloc[idx, 1:].as_matrix()
key_pts = self.key_pts_frame.iloc[idx, 1:].to_numpy()
key_pts = key_pts.astype('float').reshape(-1, 2)
sample = {'image': image, 'keypoints': key_pts}
if self.transform:
sample = self.transform(sample)
return sample
# tranforms
class Normalize(object):
"""Convert a color image to grayscale and normalize the color range to [0,1]."""
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
image_copy = np.copy(image)
key_pts_copy = np.copy(key_pts)
# convert image to grayscale
image_copy = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
# scale color range from [0, 255] to [0, 1]
image_copy= image_copy/255.0
# scale keypoints to be centered around 0 with a range of [-1, 1]
# mean = 100, sqrt = 50, so, pts should be (pts - 100)/50
key_pts_copy = (key_pts_copy - 100)/50.0
return {'image': image_copy, 'keypoints': key_pts_copy}
class Rescale(object):
"""Rescale the image in a sample to a given size.
Args:
output_size (tuple or int): Desired output size. If tuple, output is
matched to output_size. If int, smaller of image edges is matched
to output_size keeping aspect ratio the same.
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
self.output_size = output_size
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
h, w = image.shape[:2]
if isinstance(self.output_size, int):
if h > w:
new_h, new_w = self.output_size * h / w, self.output_size
else:
new_h, new_w = self.output_size, self.output_size * w / h
else:
new_h, new_w = self.output_size
new_h, new_w = int(new_h), int(new_w)
img = cv2.resize(image, (new_w, new_h))
# scale the pts, too
key_pts = key_pts * [new_w / w, new_h / h]
return {'image': img, 'keypoints': key_pts}
class RandomCrop(object):
"""Crop randomly the image in a sample.
Args:
output_size (tuple or int): Desired output size. If int, square crop
is made.
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
if isinstance(output_size, int):
self.output_size = (output_size, output_size)
else:
assert len(output_size) == 2
self.output_size = output_size
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
h, w = image.shape[:2]
new_h, new_w = self.output_size
top = np.random.randint(0, h - new_h)
left = np.random.randint(0, w - new_w)
image = image[top: top + new_h,
left: left + new_w]
key_pts = key_pts - [left, top]
return {'image': image, 'keypoints': key_pts}
class ToTensor(object):
"""Convert ndarrays in sample to Tensors."""
def __call__(self, sample):
image, key_pts = sample['image'], sample['keypoints']
# if image has no grayscale color channel, add one
if(len(image.shape) == 2):
# add that third color dim
image = image.reshape(image.shape[0], image.shape[1], 1)
# swap color axis because
# numpy image: H x W x C
# torch image: C X H X W
image = image.transpose((2, 0, 1))
return {'image': torch.from_numpy(image),
'keypoints': torch.from_numpy(key_pts)}
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
from data_load import FacialKeypointsDataset
from data_load import Rescale, RandomCrop, Normalize, ToTensor
data_transform = transforms.Compose([Rescale(250),
RandomCrop(224),
Normalize(),
ToTensor()])
assert(data_transform is not None), 'Define a data_transform'
We now transform our dataset by applying the transformations as follows:
Finally I convert it to tensors for PyTorch to use the right data structure for training.
# create the transformed dataset
transformed_dataset = FacialKeypointsDataset(csv_file='/data/training_frames_keypoints.csv',
root_dir='/data/training/',
transform=data_transform)
print('Number of images: ', len(transformed_dataset))
# iterate through the transformed dataset and print some stats about the first few samples
for i in range(4):
sample = transformed_dataset[i]
print(i, sample['image'].size(), sample['keypoints'].size())
Number of images: 3462 0 torch.Size([1, 224, 224]) torch.Size([68, 2]) 1 torch.Size([1, 224, 224]) torch.Size([68, 2]) 2 torch.Size([1, 224, 224]) torch.Size([68, 2]) 3 torch.Size([1, 224, 224]) torch.Size([68, 2])
Here, I ave set a batch size of 10 out of over 4000 images set. This is the right proportion to start with since the computation resources are limited in terms of absence of GPU.
# load training data in batches
batch_size = 10
train_loader = DataLoader(transformed_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=0)
Here i simply tested the images on the model that has not yet been trained. The results are as expected really bad. The key points do not match the face at all. I will compare this with the final results.
I have also loaded the dataset here.
# load in the test data, using the dataset class
# AND apply the data_transform you defined above
# create the test dataset
test_dataset = FacialKeypointsDataset(csv_file='/data/test_frames_keypoints.csv',
root_dir='/data/test/',
transform=data_transform)
# load test data in batches
batch_size = 10
test_loader = DataLoader(test_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=0)
To test the model on a test sample of data, I followed these steps:
This function test how the network performs on the first batch of test data. It returns the images, the transformed images, the predicted keypoints (produced by the model), and the ground truth keypoints.
# test the model on a batch of test images
def net_sample_output():
# iterate through the test dataset
for i, sample in enumerate(test_loader):
# get sample data: images and ground truth keypoints
images = sample['image']
key_pts = sample['keypoints']
# convert images to FloatTensors
images = images.type(torch.FloatTensor)
# forward pass to get net output
output_pts = net(images)
# reshape to batch_size x 68 x 2 pts
output_pts = output_pts.view(output_pts.size()[0], 68, -1)
# break after first image is tested
if i == 0:
return images, output_pts, key_pts
# call the above function
# returns: test images, test predicted keypoints, test ground truth keypoints
test_images, test_outputs, gt_pts = net_sample_output()
# print out the dimensions of the data to see if they make sense
print(test_images.data.size())
print(test_outputs.data.size())
print(gt_pts.size())
torch.Size([10, 1, 224, 224]) torch.Size([10, 68, 2]) torch.Size([10, 68, 2])
Once I had the model produce some predicted output keypoints, we can visualize these points in a way that's similar to how we've displayed this data before, only this time, we have to "un-transform" the image/keypoint data to display it.
def show_all_keypoints(image, predicted_key_pts, gt_pts=None):
"""Show image with predicted keypoints"""
# image is grayscale
plt.imshow(image, cmap='gray')
plt.scatter(predicted_key_pts[:, 0], predicted_key_pts[:, 1], s=20, marker='.', c='m')
# plot ground truth points as green pts
if gt_pts is not None:
plt.scatter(gt_pts[:, 0], gt_pts[:, 1], s=20, marker='.', c='g')
The function below was created to take the image batches and predicted key points with groud truth and display as images.
# visualize the output
# by default this shows a batch of 10 images
def visualize_output(test_images, test_outputs, gt_pts=None, batch_size=10):
for i in range(batch_size):
plt.figure(figsize=(20,10))
ax = plt.subplot(1, batch_size, i+1)
# un-transform the image data
image = test_images[i].data # get the image from it's Variable wrapper
image = image.numpy() # convert to numpy array from a Tensor
image = np.transpose(image, (1, 2, 0)) # transpose to go from torch to numpy image
# un-transform the predicted key_pts data
predicted_key_pts = test_outputs[i].data
predicted_key_pts = predicted_key_pts.numpy()
# undo normalization of keypoints
predicted_key_pts = predicted_key_pts*50.0+100
# plot ground truth points for comparison, if they exist
ground_truth_pts = None
if gt_pts is not None:
ground_truth_pts = gt_pts[i]
ground_truth_pts = ground_truth_pts*50.0+100
# call show_all_keypoints
show_all_keypoints(np.squeeze(image), predicted_key_pts, ground_truth_pts)
plt.axis('off')
plt.show()
# call it
visualize_output(test_images, test_outputs, gt_pts)
bold text## 12. Training
We have created a class in a standalone Python program like we did in lab 2 of Computer Vision Course. This class contains the CNN model that we are going to apply.
Here, the model that we're using is based on three convolutional layers. Here we created a model in the class defined in standalone Python Program called 'Models'. The model we use is a Convolutional Neural Network that uses 3 convolutional layers.
The first layer is convolved with a kernel_size of 5 for the input images that are single channeled. The output we generate is 32 channel which means we use 32 filters of 5x5 size with a stride of 1 to perform convolution.
In the second layer we take 32 channels as inputs from the previous channel and apply 64 filters of size 3x3x32 to generate a 64 chanelled output. The stride remains 1.
Finally, the third layer uses a kernel of 3x3x64. We are increasing the channels but reducing the height and width dimensions of each channel of the image. The reason behind more filters is to obtain more intricate details of the facial features to better define the placement of keypoints on the facial features on the images.
We then use Maxpooling which basically uses a kernel size of 2 and a stride of 2. We have also added a dilation of 2. This means that the pooling operation will be applied with gaps, effectively skipping elements in the input tensor. The output tensor will have reduced spatial dimensions due to the pooling operation.
We then use dropout which means we turn the neurons or filters to 0 for training to avoid overfitting. This gives a better generalization of the model on unseen test data. finally we apply a linear layer with a total of 8252 inputs which are flattened from the previous layer output.
We are going to use a regression loss in this project instead of cross entropy since training a network to predict keypoints is different than training a network to predict a class; instead of outputting a distribution of classes and using cross entropy loss, you may want to choose a loss function that is suited for regression, which directly compares a predicted value and target value.
#Define the loss and optimization
import torch.optim as optim
criterion = nn.SmoothL1Loss()
optimizer = optim.Adam(net.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
The program below simply executes the training of the model on training dataset
def train_net(n_epochs):
# prepare the net for training
net.train()
for epoch in range(n_epochs): # loop over the dataset multiple times
running_loss = 0.0
# train on batches of data, assumes you already have train_loader
for batch_i, data in enumerate(train_loader):
# get the input images and their corresponding labels
images = data['image']
key_pts = data['keypoints']
# flatten pts
key_pts = key_pts.view(key_pts.size(0), -1)
# convert variables to floats for regression loss
key_pts = key_pts.type(torch.FloatTensor)
images = images.type(torch.FloatTensor)
# forward pass to get outputs
output_pts = net(images)
# calculate the loss between predicted and target keypoints
loss = criterion(output_pts, key_pts)
# zero the parameter (weight) gradients
optimizer.zero_grad()
# backward pass to calculate the weight gradients
loss.backward()
# update the weights
optimizer.step()
# print loss statistics
running_loss += loss.item()
if batch_i % 10 == 9: # print every 10 batches
print('Epoch: {}, Batch: {}, Avg. Loss: {}'.format(epoch + 1, batch_i+1, running_loss/10))
running_loss = 0.0
print('Finished Training')
We set batch size of 10 and after multiple iterations, we used 5 epochs only when I got access to Colab GPUs. On CPU the training took substantially long for even 1 epoch.
# train your network
n_epochs = 5 # start small, and increase when you've decided on your model structure and hyperparams
# this is a Workspaces-specific context manager to keep the connection
# alive while training your model, not part of pytorch
# with active_session():
train_net(n_epochs)
Epoch: 1, Batch: 10, Avg. Loss: 0.15559974163770676 Epoch: 1, Batch: 20, Avg. Loss: 0.0998879823833704 Epoch: 1, Batch: 30, Avg. Loss: 0.09645124413073063 Epoch: 1, Batch: 40, Avg. Loss: 0.08078283481299878 Epoch: 1, Batch: 50, Avg. Loss: 0.08815021626651287 Epoch: 1, Batch: 60, Avg. Loss: 0.0833617027848959 Epoch: 1, Batch: 70, Avg. Loss: 0.08245876096189023 Epoch: 1, Batch: 80, Avg. Loss: 0.11246618181467057 Epoch: 1, Batch: 90, Avg. Loss: 0.07667525336146355 Epoch: 1, Batch: 100, Avg. Loss: 0.09759145975112915 Epoch: 1, Batch: 110, Avg. Loss: 0.09515341706573963 Epoch: 1, Batch: 120, Avg. Loss: 0.07899222560226918 Epoch: 1, Batch: 130, Avg. Loss: 0.0813803106546402 Epoch: 1, Batch: 140, Avg. Loss: 0.09998265951871872 Epoch: 1, Batch: 150, Avg. Loss: 0.09329819567501545 Epoch: 1, Batch: 160, Avg. Loss: 0.0770112231373787 Epoch: 1, Batch: 170, Avg. Loss: 0.08439573384821415 Epoch: 1, Batch: 180, Avg. Loss: 0.0912845827639103 Epoch: 1, Batch: 190, Avg. Loss: 0.12753089405596257 Epoch: 1, Batch: 200, Avg. Loss: 0.11167270690202713 Epoch: 1, Batch: 210, Avg. Loss: 0.08272253610193729 Epoch: 1, Batch: 220, Avg. Loss: 0.09263037219643593 Epoch: 1, Batch: 230, Avg. Loss: 0.0864143941551447 Epoch: 1, Batch: 240, Avg. Loss: 0.08157863616943359 Epoch: 1, Batch: 250, Avg. Loss: 0.0787555605173111 Epoch: 1, Batch: 260, Avg. Loss: 0.09548077583312989 Epoch: 1, Batch: 270, Avg. Loss: 0.09516790844500064 Epoch: 1, Batch: 280, Avg. Loss: 0.08345958217978477 Epoch: 1, Batch: 290, Avg. Loss: 0.09800799638032913 Epoch: 1, Batch: 300, Avg. Loss: 0.09316407665610313 Epoch: 1, Batch: 310, Avg. Loss: 0.09410692080855369 Epoch: 1, Batch: 320, Avg. Loss: 0.09452221691608428 Epoch: 1, Batch: 330, Avg. Loss: 0.11305218115448952 Epoch: 1, Batch: 340, Avg. Loss: 0.08508988954126835 Epoch: 2, Batch: 10, Avg. Loss: 0.08334831558167935 Epoch: 2, Batch: 20, Avg. Loss: 0.09556462615728378 Epoch: 2, Batch: 30, Avg. Loss: 0.08929145075380802 Epoch: 2, Batch: 40, Avg. Loss: 0.09117609784007072 Epoch: 2, Batch: 50, Avg. Loss: 0.13441120311617852 Epoch: 2, Batch: 60, Avg. Loss: 0.08986027911305428 Epoch: 2, Batch: 70, Avg. Loss: 0.07985136359930038 Epoch: 2, Batch: 80, Avg. Loss: 0.10602720193564892 Epoch: 2, Batch: 90, Avg. Loss: 0.0925805039703846 Epoch: 2, Batch: 100, Avg. Loss: 0.08440154008567333 Epoch: 2, Batch: 110, Avg. Loss: 0.07835752218961715 Epoch: 2, Batch: 120, Avg. Loss: 0.08837461732327938 Epoch: 2, Batch: 130, Avg. Loss: 0.07830007895827293 Epoch: 2, Batch: 140, Avg. Loss: 0.09119570553302765 Epoch: 2, Batch: 150, Avg. Loss: 0.08612417615950108 Epoch: 2, Batch: 160, Avg. Loss: 0.08188316076993943 Epoch: 2, Batch: 170, Avg. Loss: 0.10375957787036896 Epoch: 2, Batch: 180, Avg. Loss: 0.09773113615810872 Epoch: 2, Batch: 190, Avg. Loss: 0.09376656971871852 Epoch: 2, Batch: 200, Avg. Loss: 0.0880811333656311 Epoch: 2, Batch: 210, Avg. Loss: 0.08581293411552907 Epoch: 2, Batch: 220, Avg. Loss: 0.08484842516481876 Epoch: 2, Batch: 230, Avg. Loss: 0.10116047337651253 Epoch: 2, Batch: 240, Avg. Loss: 0.09334623292088509 Epoch: 2, Batch: 250, Avg. Loss: 0.13518495559692384 Epoch: 2, Batch: 260, Avg. Loss: 0.09181894920766354 Epoch: 2, Batch: 270, Avg. Loss: 0.09260329529643059 Epoch: 2, Batch: 280, Avg. Loss: 0.10301977396011353 Epoch: 2, Batch: 290, Avg. Loss: 0.08109150268137455 Epoch: 2, Batch: 300, Avg. Loss: 0.09746350273489952 Epoch: 2, Batch: 310, Avg. Loss: 0.11725508496165275 Epoch: 2, Batch: 320, Avg. Loss: 0.08793187700212002 Epoch: 2, Batch: 330, Avg. Loss: 0.09119766652584076 Epoch: 2, Batch: 340, Avg. Loss: 0.09279421418905258 Epoch: 3, Batch: 10, Avg. Loss: 0.09628951027989388 Epoch: 3, Batch: 20, Avg. Loss: 0.08804050870239735 Epoch: 3, Batch: 30, Avg. Loss: 0.08831471502780915 Epoch: 3, Batch: 40, Avg. Loss: 0.07734548822045326 Epoch: 3, Batch: 50, Avg. Loss: 0.0872230265289545 Epoch: 3, Batch: 60, Avg. Loss: 0.10727584175765514 Epoch: 3, Batch: 70, Avg. Loss: 0.08133872374892234 Epoch: 3, Batch: 80, Avg. Loss: 0.07543746232986451 Epoch: 3, Batch: 90, Avg. Loss: 0.09145248420536518 Epoch: 3, Batch: 100, Avg. Loss: 0.09163005389273167 Epoch: 3, Batch: 110, Avg. Loss: 0.08819816075265408 Epoch: 3, Batch: 120, Avg. Loss: 0.08541103601455688 Epoch: 3, Batch: 130, Avg. Loss: 0.09560612812638283 Epoch: 3, Batch: 140, Avg. Loss: 0.0733803439885378 Epoch: 3, Batch: 150, Avg. Loss: 0.07366168014705181 Epoch: 3, Batch: 160, Avg. Loss: 0.11222208701074124 Epoch: 3, Batch: 170, Avg. Loss: 0.08418717235326767 Epoch: 3, Batch: 180, Avg. Loss: 0.08581007793545722 Epoch: 3, Batch: 190, Avg. Loss: 0.09481645487248898 Epoch: 3, Batch: 200, Avg. Loss: 0.10136774517595767 Epoch: 3, Batch: 210, Avg. Loss: 0.08223892152309417 Epoch: 3, Batch: 220, Avg. Loss: 0.07516598142683506 Epoch: 3, Batch: 230, Avg. Loss: 0.08811481222510338 Epoch: 3, Batch: 240, Avg. Loss: 0.09985608905553818 Epoch: 3, Batch: 250, Avg. Loss: 0.11863587200641632 Epoch: 3, Batch: 260, Avg. Loss: 0.08422406278550625 Epoch: 3, Batch: 270, Avg. Loss: 0.09925757683813571 Epoch: 3, Batch: 280, Avg. Loss: 0.0942869808524847 Epoch: 3, Batch: 290, Avg. Loss: 0.0980942852795124 Epoch: 3, Batch: 300, Avg. Loss: 0.09764249660074711 Epoch: 3, Batch: 310, Avg. Loss: 0.08758199177682399 Epoch: 3, Batch: 320, Avg. Loss: 0.08817696124315262 Epoch: 3, Batch: 330, Avg. Loss: 0.08395315930247307 Epoch: 3, Batch: 340, Avg. Loss: 0.06953471899032593 Epoch: 4, Batch: 10, Avg. Loss: 0.08372595869004726 Epoch: 4, Batch: 20, Avg. Loss: 0.07990848533809185 Epoch: 4, Batch: 30, Avg. Loss: 0.08095178417861462 Epoch: 4, Batch: 40, Avg. Loss: 0.07931016013026237 Epoch: 4, Batch: 50, Avg. Loss: 0.10377110615372657 Epoch: 4, Batch: 60, Avg. Loss: 0.08203652761876583 Epoch: 4, Batch: 70, Avg. Loss: 0.0893926803022623 Epoch: 4, Batch: 80, Avg. Loss: 0.07220861092209815 Epoch: 4, Batch: 90, Avg. Loss: 0.0918916866183281 Epoch: 4, Batch: 100, Avg. Loss: 0.11998870335519314 Epoch: 4, Batch: 110, Avg. Loss: 0.08917231597006321 Epoch: 4, Batch: 120, Avg. Loss: 0.0941883608698845 Epoch: 4, Batch: 130, Avg. Loss: 0.0992554035037756 Epoch: 4, Batch: 140, Avg. Loss: 0.08958284482359886 Epoch: 4, Batch: 150, Avg. Loss: 0.07421138510107994 Epoch: 4, Batch: 160, Avg. Loss: 0.08867242597043515 Epoch: 4, Batch: 170, Avg. Loss: 0.07752237282693386 Epoch: 4, Batch: 180, Avg. Loss: 0.09899479411542415 Epoch: 4, Batch: 190, Avg. Loss: 0.0788869496434927 Epoch: 4, Batch: 200, Avg. Loss: 0.1059618677943945 Epoch: 4, Batch: 210, Avg. Loss: 0.10044278502464295 Epoch: 4, Batch: 220, Avg. Loss: 0.0859668780118227 Epoch: 4, Batch: 230, Avg. Loss: 0.0857542235404253 Epoch: 4, Batch: 240, Avg. Loss: 0.09621186219155789 Epoch: 4, Batch: 250, Avg. Loss: 0.08417024351656437 Epoch: 4, Batch: 260, Avg. Loss: 0.08190026208758354 Epoch: 4, Batch: 270, Avg. Loss: 0.0870655357837677 Epoch: 4, Batch: 280, Avg. Loss: 0.10569184124469758 Epoch: 4, Batch: 290, Avg. Loss: 0.09112018756568432 Epoch: 4, Batch: 300, Avg. Loss: 0.08827679231762886 Epoch: 4, Batch: 310, Avg. Loss: 0.09921696670353412 Epoch: 4, Batch: 320, Avg. Loss: 0.08851422220468522 Epoch: 4, Batch: 330, Avg. Loss: 0.07642570957541465 Epoch: 4, Batch: 340, Avg. Loss: 0.08672293536365032 Epoch: 5, Batch: 10, Avg. Loss: 0.07727267295122146 Epoch: 5, Batch: 20, Avg. Loss: 0.08527704663574695 Epoch: 5, Batch: 30, Avg. Loss: 0.08056839592754841 Epoch: 5, Batch: 40, Avg. Loss: 0.09218861721456051 Epoch: 5, Batch: 50, Avg. Loss: 0.08306980282068252 Epoch: 5, Batch: 60, Avg. Loss: 0.10115793272852898 Epoch: 5, Batch: 70, Avg. Loss: 0.07590638790279627 Epoch: 5, Batch: 80, Avg. Loss: 0.11359120123088359 Epoch: 5, Batch: 90, Avg. Loss: 0.07911615669727326 Epoch: 5, Batch: 100, Avg. Loss: 0.08468888215720653 Epoch: 5, Batch: 110, Avg. Loss: 0.08508007973432541 Epoch: 5, Batch: 120, Avg. Loss: 0.07766774743795395 Epoch: 5, Batch: 130, Avg. Loss: 0.09794549457728863 Epoch: 5, Batch: 140, Avg. Loss: 0.09866627492010593 Epoch: 5, Batch: 150, Avg. Loss: 0.09977062195539474 Epoch: 5, Batch: 160, Avg. Loss: 0.09203771315515041 Epoch: 5, Batch: 170, Avg. Loss: 0.10039681755006313 Epoch: 5, Batch: 180, Avg. Loss: 0.08329381085932255 Epoch: 5, Batch: 190, Avg. Loss: 0.09072607159614562 Epoch: 5, Batch: 200, Avg. Loss: 0.08301016017794609 Epoch: 5, Batch: 210, Avg. Loss: 0.09773627370595932 Epoch: 5, Batch: 220, Avg. Loss: 0.08306130580604076 Epoch: 5, Batch: 230, Avg. Loss: 0.10608891621232033 Epoch: 5, Batch: 240, Avg. Loss: 0.07313544936478138 Epoch: 5, Batch: 250, Avg. Loss: 0.0966971643269062 Epoch: 5, Batch: 260, Avg. Loss: 0.07796307019889355 Epoch: 5, Batch: 270, Avg. Loss: 0.09756981208920479 Epoch: 5, Batch: 280, Avg. Loss: 0.09608901664614677 Epoch: 5, Batch: 290, Avg. Loss: 0.08461961634457112 Epoch: 5, Batch: 300, Avg. Loss: 0.09202436134219169 Epoch: 5, Batch: 310, Avg. Loss: 0.09953641556203366 Epoch: 5, Batch: 320, Avg. Loss: 0.09405150413513183 Epoch: 5, Batch: 330, Avg. Loss: 0.08596288412809372 Epoch: 5, Batch: 340, Avg. Loss: 0.09620198681950569 Finished Training
Now, I will show the performane of the model on the test dataset that we previously loaded and converted to tensors. The results are shown below for the predicted vs the ground truth facial keypoints marked on the faces.
# get a sample of test data again
test_images, test_outputs, gt_pts = net_sample_output()
print(test_images.data.size())
print(test_outputs.data.size())
print(gt_pts.size())
torch.Size([10, 1, 224, 224]) torch.Size([10, 68, 2]) torch.Size([10, 68, 2])
Here we have visualized some images from the test data set and to compare the predicted keypoints with the true keypoints.
#visualizING test output
visualize_output(test_images, test_outputs, gt_pts)
Here I simply saved the model in my directory
model_name = 'keypoints_model_RG.pt'
# after training, saved my model parameters
torch.save(net.state_dict(), model_name)
Here I extracted a single filter from my first convolutional layer. Its in grayscalle grid. We want to visualize what filters are convolved with the images. Each channel is conolved with a filter which has a specific target to achieve from the spatial patterns.
# Get the weights in the first conv layer, "conv1"
# if necessary, change this to reflect the name of your first conv layer
weights1 = net.conv1.weight.data
w = weights1.numpy()
filter_index = 0
print(w[filter_index][0])
print(w[filter_index][0].shape)
# display the filter weights
plt.imshow(w[filter_index][0], cmap='gray')
[[-0.11316667 0.05901043 0.06459889 0.03028295 0.09153267] [-0.17570217 0.1590816 -0.09632127 0.13132556 0.15489411] [ 0.04300174 -0.00570017 -0.18424891 -0.13565046 0.0857215 ] [ 0.10661493 0.15387483 -0.02937688 0.17278284 -0.16287693] [-0.08857308 0.08667422 0.06126913 0.13169648 -0.10188922]] (5, 5)
<matplotlib.image.AxesImage at 0x7d63015c7430>
Now that I hae trained the neural network to detect facial keypoints, I will evaluate its performance on images that in the test set. Also, I will test it on my own images from the gallery.
Detect all the faces in an image using a face detector (we'll be using a Haar Cascade detector in this notebook).
Pre-process those face images so that they are grayscale, and transformed to a Tensor of the input size that the model expects.
Use the trained model to detect facial keypoints on the image.
I selected a couple of images from an event in my family. We'll see how the model Net() is trained and the results on these images from a completely different distribution. These images have noisy backgrounds and multiple faces for better evaluation of the model.
Step 1: I have imported the images as jpegs
Step 2: Detect faces with Haar cascade classifier
Step 3: Create a bounding box around the face
import cv2
# load in color image for face detection
image = cv2.imread('obamas.jpg')
# switch red and blue color channels
# --> by default OpenCV assumes BLUE comes first, not RED as in many images
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# plot the image
fig = plt.figure(figsize=(9,9))
plt.imshow(image)
# load in a haar cascade classifier for detecting frontal faces
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
# run the detector
# the output here is an array of detections; the corners of each detection box
# if necessary, modify these parameters until you successfully identify every face in a given image
faces = face_cascade.detectMultiScale(image, 1.2, 2)
# make a copy of the original image to plot detections on
image_with_detections = image.copy()
# loop over the detected faces, mark the image where each face is found
for (x,y,w,h) in faces:
# draw a rectangle around each detected face
# you may also need to change the width of the rectangle drawn depending on image resolution
cv2.rectangle(image_with_detections,(x,y),(x+w,y+h),(255,0,0),3)
fig = plt.figure(figsize=(9,9))
plt.imshow(image_with_detections)
import torch
from models import Net
net = Net()
net.load_state_dict(torch.load('keypoints_model_RG.pt'))
## print out your net and prepare it for testing (uncomment the line below)
net.eval()
Net( (conv1): Conv2d(1, 32, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1)) (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1)) (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (drop2d): Dropout2d(p=0.2, inplace=False) (drop): Dropout(p=0.2, inplace=False) (fc1): Linear(in_features=86528, out_features=136, bias=True) )
Using images from a personal event of my family, I tested the model with the results below:
The working is as follows:
The sizes of faces detected by a Haar detector and the faces my network has been trained on are of different sizes. We added padding since the model generated keypoints that were too small for the given face.
The sizes of faces detected by a Haar detector and the faces your network has been trained on are of different sizes. If you find that your model is generating keypoints that are too small for a given face, try adding some padding to the detected roi before giving it as input to your model.
def show_all_keypoints(image, keypoints):
"""
Visuzlizing the image and the keypoints on it.
"""
plt.figure(figsize=(5,5))
keypoints = keypoints.data.numpy()
keypoints = keypoints * 55.0 + 95 # adjust keypoint size.
keypoints = np.reshape(keypoints, (68, -1)) # reshape for proper display on the face
image = image.numpy()
image = np.transpose(image, (1, 2, 0)) # convert to numpy image shape (H x W x C)
image = np.squeeze(image)
plt.imshow(image, cmap='gray')
plt.scatter(keypoints[:, 0], keypoints[:, 1], s=40, marker='.', c='m')
image_copy = np.copy(image)
# loop over the detected faces from haar cascade
for (x,y,w,h) in faces:
# Select the region of interest that is the face in the image
roi = image_copy[y-50:y+h+50, x-50:x+w+50]
#Convert the face region from RGB to grayscale
roi = cv2.cvtColor(roi, cv2.COLOR_RGB2GRAY)
#Normalize the grayscale image so that its color range falls in [0,1] instead of [0,255]
roi = roi/255
#Rescale the detected face to be the expected square size for your CNN (224x224, suggested)
roi = cv2.resize(roi, (224, 224))
#Reshape the numpy image shape (H x W x C) into a torch image shape (C x H x W)
roi = np.reshape(roi, (1, 1, 224, 224))
roi_tensor = torch.from_numpy(roi)
roi_tensor = roi_tensor.type(torch.FloatTensor)
#Make facial keypoint predictions using your loaded, trained network
keypoints = net(roi_tensor)
#Display each detected face and the corresponding keypoints
show_all_keypoints(roi_tensor.squeeze(0), keypoints)
image = cv2.imread('ZMJ00269-596.jpg')
# switch red and blue color channels
# --> by default OpenCV assumes BLUE comes first, not RED as in many images
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# plot the image
fig = plt.figure(figsize=(9,9))
plt.imshow(image)
# load in a haar cascade classifier for detecting frontal faces
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
# run the detector
# the output here is an array of detections; the corners of each detection box
# if necessary, modify these parameters until you successfully identify every face in a given image
faces = face_cascade.detectMultiScale(image, 1.2, 2)
# make a copy of the original image to plot detections on
image_with_detections = image.copy()
# loop over the detected faces, mark the image where each face is found
for (x,y,w,h) in faces:
# draw a rectangle around each detected face
# you may also need to change the width of the rectangle drawn depending on image resolution
cv2.rectangle(image_with_detections,(x,y),(x+w,y+h),(255,0,0),3)
fig = plt.figure(figsize=(9,9))
plt.imshow(image_with_detections)
import torch
from models import Net
net = Net()
net.load_state_dict(torch.load('keypoints_model_RG.pt'))
image_copy = np.copy(image)
# loop over the detected faces from haar cascade
for (x,y,w,h) in faces:
# Select the region of interest that is the face in the image
roi = image_copy[y-50:y+h+50, x-50:x+w+50]
#Convert the face region from RGB to grayscale
roi = cv2.cvtColor(roi, cv2.COLOR_RGB2GRAY)
#Normalize the grayscale image so that its color range falls in [0,1] instead of [0,255]
roi = roi/255
#Rescale the detected face to be the expected square size for your CNN (224x224, suggested)
roi = cv2.resize(roi, (224, 224))
#Reshape the numpy image shape (H x W x C) into a torch image shape (C x H x W)
roi = np.reshape(roi, (1, 1, 224, 224))
roi_tensor = torch.from_numpy(roi)
roi_tensor = roi_tensor.type(torch.FloatTensor)
#Make facial keypoint predictions using your loaded, trained network
keypoints = net(roi_tensor)
#Display each detected face and the corresponding keypoints
show_all_keypoints(roi_tensor.squeeze(0), keypoints)
image = cv2.imread('ZMJ00063(1)-486.jpg')
# switch red and blue color channels
# --> by default OpenCV assumes BLUE comes first, not RED as in many images
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# plot the image
fig = plt.figure(figsize=(9,9))
plt.imshow(image)
# load in a haar cascade classifier for detecting frontal faces
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
# run the detector
# the output here is an array of detections; the corners of each detection box
# if necessary, modify these parameters until you successfully identify every face in a given image
faces = face_cascade.detectMultiScale(image, 1.2, 2)
# make a copy of the original image to plot detections on
image_with_detections = image.copy()
# loop over the detected faces, mark the image where each face is found
for (x,y,w,h) in faces:
# draw a rectangle around each detected face
# you may also need to change the width of the rectangle drawn depending on image resolution
cv2.rectangle(image_with_detections,(x,y),(x+w,y+h),(255,0,0),3)
fig = plt.figure(figsize=(9,9))
plt.imshow(image_with_detections)
import torch
from models import Net
net = Net()
net.load_state_dict(torch.load('keypoints_model_RG.pt'))
image_copy = np.copy(image)
# loop over the detected faces from haar cascade
for (x,y,w,h) in faces:
# Select the region of interest that is the face in the image
roi = image_copy[y-50:y+h+50, x-50:x+w+50]
#Convert the face region from RGB to grayscale
roi = cv2.cvtColor(roi, cv2.COLOR_RGB2GRAY)
#Normalize the grayscale image so that its color range falls in [0,1] instead of [0,255]
roi = roi/255
#Rescale the detected face to be the expected square size for your CNN (224x224, suggested)
roi = cv2.resize(roi, (224, 224))
#Reshape the numpy image shape (H x W x C) into a torch image shape (C x H x W)
roi = np.reshape(roi, (1, 1, 224, 224))
roi_tensor = torch.from_numpy(roi)
roi_tensor = roi_tensor.type(torch.FloatTensor)
#Make facial keypoint predictions using your loaded, trained network
keypoints = net(roi_tensor)
#Display each detected face and the corresponding keypoints
show_all_keypoints(roi_tensor.squeeze(0), keypoints)
In the first image we see that the noise in the background is less. Therefore, the Haar Cascade detects the face accurately and ignores any other background noise.
In the second image of Obama, the faces are detected and other noise is ignored. And the facial keypoints are marked in the correct region however, they are more spread out across the face. Therefore, the keypoints basically surround the facial shape.
Finally in the third image, we see that the face is detected however, the noise is also detected because the noise in the background is very noticeable and resemle the pixels of generic faces as detected by Haar Cascades.
The combination of Smooth L1 loss and the Adam optimizer is well-suited for keypoint detection tasks:
Smooth L1 loss is robust to outliers and helps in handling noisy keypoints annotations.
Adam optimizer adjusts learning rates adaptively, leading to faster convergence and better handling of complex loss landscapes.
My original architecture consisted of two blocks of two convolutional layers, an interstitial layer of maximum pooling, and an output layer of maximum pooling and fully convolutional layers. Because the model was too big for this workspace, I switched to a more straightforward model that alternates convolutional and max pooling layers, with a smaller fully connected layer at the end. Both 2D dropout and regular dropout were included prior to the completely conected layer, following the first convolutional layer.
I tried training the model with higher number of epochs but the workspace was not allowing giving error that data loader is killed. I was also training on the CPU. Therefore, I tried with a maximum of 2 epochs before it stopped working.
However, as I have used Google Colab to perform Labs of this course, which provided me with GPUs. Therefore, I iterated and achieved better performance and lower loss with 5 epochs. Increasing the batch size was also giving similar memory overflow issue so I decided to stick with the provided value of 10.
Most mportant results seemed to be decent with this values.
This project is highly dependent on the type of normalization we perform on our images since each image is of different size and parameters. Therefore, one challenge in this project was to understand the required normalization and the program to execute it. Moreover, the because of computation limits, the training process had to be limited. Therefore, the maximum computation I could perform was on 5 epochs at a batch size of 10 given the amount of data set. The results are not fully accurate and partly the reason for this is the limitation of computation. However the project serves the purpose of identifying the patterns of facial features in the images. Using Haar Cascades the likely features are identified in the images and then the facial keypoints are predicted using our trained Convolutional Neuaral Network Model.